add SQL adapter #779

skarakuzu · 2024-08-22T15:59:58Z

preliminary start of sql adapter. to be continued ...

Checklist

Add a Changelog entry
Add the ticket number which this PR closes to the comment section

pyproject.toml

danielballan · 2024-10-01T15:43:02Z

Lifecycle:

Client declares that it wants to create a new tabular dataset, via a request POST /api/v1/metadata/my_table.
In "catalog" SQL database, the server adds a row to the nodes table with any metadata about this table. This is how the new table is connected to any overall dataset, like Bluesky scan and its Scan ID.
Also in the "catalog" SQL database, the server adds a row each to the data_sources table and the assets table. Together, they describe how to locate where the new data will be saved. The Asset part is very locked down. It has room for the URI of the tabular SQL database: postgresql://... and some boilerplate. The DataSource has a freeform area called parameters, which can fit any JSON. We can use this to put in dataset-specific details, like the name of the SQL table (table_name)---derived from the Arrow schema in this case---and a means of selecting the rows of interest for this new dataset (dataset_id).
When data is written or read, a SQLAdapter object is instantiated inside the server. It is passed information extracted from this DataSource and Asset. So, it can know the table_name and the dataset_id.

danielballan · 2024-12-13T20:39:54Z

Test script:

import pandas
from tiled.client import from_uri
from tiled.structures.core import StructureFamily
from tiled.structures.data_source import Asset, DataSource, Management
from tiled.structures.table import TableStructure

client = from_uri("http://localhost:8000", api_key="secret")

df = pandas.DataFrame({"a": [1, 2, 3], "b": [1., 2., 3.]})
structure = TableStructure.from_pandas(df)

x = client.new(
    structure_family=StructureFamily.table,
    data_sources=[
        DataSource(
            management=Management.writable,
            mimetype="application/x-tiled-sql-table",
            structure_family=StructureFamily.table,
	        structure=structure,
            assets=[],
        ),
    ],
    metadata={},
    specs=[],
    key="x",
)
~~x.write(df)~~
x.append_partition(df, 0)

# This does not work yet
# x.read()  # calls /table/partition/x?partition=0 adapter.read_partition()

danielballan · 2025-01-16T17:10:05Z

For this PR

Add dataset_id column and filter by it.
Create table eagerly.
In Adapter, remove write. Write would mean "overwrite" or "replace" and we are not sure we want to expose this. (We can add it later if we want it.)
In client, replace write_appendable_dataframe with create_appendable_dataframe. This will run the self.new(...) call, which runs init_storage on the server side, but it will not take any data. Data will be appended in later calls.
In Adapter, I removed append and used append_partition. (For now it's stuck at partition=0 but this constraint will be temporary.) Tests need to be updated.
Execute CREATE INDEX IF NOT EXISTS .... on dataset_id column.
Pandas indexes should round-trip. (Dan)
Protect against SQL injection. In init_storage, table_name should match some restrictive regex pattern. Maybe lowercase letters, numbers, and underscores?

Intended usage now looks like...

The following prompts the server to:

Generate a table_name from schema hash. (The table might or might not already exist, containing rows from other dataset_ids.)
Generate a new unique dataset_id for this dataset.
Store the table_name, dataset_id, and any metadata passed here in the catalog database.

# This uploads no data.
x = client.create_appendable_table(schema, key="x")

The following prompts the server to:

Create the table {table_name} if it not yet exist.
Ingest the rows into that table, with an additional dataset_id column.

# Now data can be added, potentially in parallel.
x.append_partition(df, 0)

In a separate process, this would also work. We can access an existing table and keep appending.

x = client["x"]
x.append_partition(df, 0)

In following up PRs...

Support PG database with credentials.
Connection pooling
Supporting more than one partition. SQL will scale find to a large table, but current Tiled does not let the client request less than a full partition. We either need to change that and let users request row ranges (seems complicated, especially with parquet...so I think might be something to wait to do...) or mark up the data in the SQL table as belonging to reasonably-sized partitions. Similar to how arrays are chunked by the client, table rows should be partitioned.

Maybe in the future partitions are added like this? Not sure whether PostgreSQL native "table partitioning" fits our use case.

# table_blahblahblah
dataset_id partition_id ...
12345        1
12345        1
12345        2
12345        3
12345        3
12345        3
24323

def read_partition(self, partition):
    query = f"SELECT * FROM {self.table_name} WHERE dataset_id={self.dataset_id} AND partition={partition}"
    ...

preliminary start of sql adapter. to be continued ... hashed table names. to be continued... modified hashing and added a test for sqlite database. to be continued try TILED_TEST_POSTGRESQL_URI usage fix postgreql uri Automatically set SQL driver if unset. Do not require env var to be set. Consistently use database URI with schema. Refactor init_storage interface for SQL. More adapters updated More adapters updated Parse uri earlier. Use dataclass version of DataSource. Begin to update SQLAdapter. Fix import Typesafe accessor for Storage few changes Basic write and append works Do not preserve index. changes in test_sql.py latest changes tried to fix the tests removed prints Remove vestigial comment. Extract str path from sqlite URI Use unique temp dir and clean it up. some more fixing and addition of partitions fixing docstrings CLI works with SQL writing Tests pass again Add convenience method write_appendable_dataframe. Fix typo Fix path handling for Windows The dataset_id concept is mostly implemented Fix conditional Support appendable tables with --temp catalog Revert order swap (for now)

Co-authored-by: Eugene <[email protected]>

genematx · 2025-02-24T20:08:06Z

tiled/adapters/sql.py

+            for key in self._structure.columns
+        )
+
+    def append_partition(


Just trying to use this adapter to write tabular data. Is there a reason it only has the append_partition method, not write_partition or write like the rest of writable adapters (at least CSV and Parquet adapters seem to have them). I think it would be nice to unify and standardize the interfaces of all writable adapters, similarly to how we have done this with .from_catalog and .from_uris methods to make it less confusing for the user. @skarakuzu

It was implemented initially but removed later. If I remember correctly, it is because write or write_partition means deleting the existing table and starting a new one but we do not want to delete the tables.

Ah, I see. Thank you for the explaining this, Seher

danielballan reviewed Aug 22, 2024

View reviewed changes

pyproject.toml Show resolved Hide resolved

danielballan force-pushed the add_sql_adapter branch from e5858a1 to 8f676e8 Compare September 11, 2024 21:17

danielballan force-pushed the add_sql_adapter branch from add79c8 to 7679190 Compare October 2, 2024 13:18

danielballan mentioned this pull request Nov 4, 2024

Do not require SQL URIs to be prefixed with SQLAlchemy driver #810

Merged

2 tasks

danielballan force-pushed the add_sql_adapter branch from c0b88c6 to baefdbf Compare December 13, 2024 16:18

skarakuzu force-pushed the add_sql_adapter branch from 01a6f18 to 2befce1 Compare December 18, 2024 20:15

skarakuzu force-pushed the add_sql_adapter branch 2 times, most recently from 7913020 to 75b2ddc Compare January 15, 2025 17:56

danielballan force-pushed the add_sql_adapter branch from 9b5b378 to 796f001 Compare January 28, 2025 17:44

danielballan force-pushed the add_sql_adapter branch 2 times, most recently from 60c6f2c to 0c0473e Compare February 12, 2025 15:30

Seher Karakuzu and others added 11 commits February 12, 2025 10:33

some further changes for sql adapter

4389b68

Fix rebase error

6f106a0

few more changes

5f8d3ac

Implement from_catalog on SQLAdapter

35e25f5

fixed sql tests

0d4d8f5

fix typing

c437347

add TYPE_CHECKING conditional

4dbfd20

Adjust for dataclass structures.

f004ae2

postgres text fix

f1f7d4c

fix mypy in test

2f3abf7

skarakuzu force-pushed the add_sql_adapter branch from ed8b099 to 2f3abf7 Compare February 12, 2025 15:36

skarakuzu and others added 4 commits February 12, 2025 15:48

add sql tests

f6c8cf4

Add comments.

088ea39

Format index creation more nicely

0d0022e

Clean up arrow to SQL type translation.

40dd070

danielballan and others added 6 commits February 13, 2025 13:19

Remove more needless conversions

781fb13

Co-authored-by: Eugene <[email protected]>

Remove access policy

748cc47

Co-authored-by: Eugene <[email protected]>

Finish removing access policy

1b3148b

Co-authored-by: Eugene <[email protected]>

mypy fixes and addressed comments

3f374d6

typing fix and addressed comments

07ae5a8

mypy fix

eaa792e

danielballan mentioned this pull request Feb 13, 2025

Roll back appendable CSV data #886

Open

skarakuzu and others added 21 commits February 13, 2025 17:05

added CHANGELOG

280c115

fixed heavy import test

1c4bf2b

Type-check Connection.

2fc12a6

Validate partition number.

75bb797

Inline small function only used once

a87bccc

whitespace changes for readability

e009cda

Co-authored-by: Eugene <[email protected]>

Update 'See Also' for renamed method.

a34c8ac

Remove unreachable code and consolidate temporaries.

6191c16

Tweak server startup messages.

e1aaa3d

Refactor Storage

2f2d787

Follow-up fix to Storage refactor

91c4092

Sketch support for multiple tiled partitions.

f6c89aa

Use SEQUENCE instead of random dataset_id.

dc19d2c

Fix typo

5d2c06d

Use int32 for dataset ID.

e7854b0

added and refactored tests for sql adapter

50a6d77

added field specific reading tests for sql adapter

cece183

fixed psql test

e506fde

more fix

b7d0646

Row order is not guaranteed.

b420319

Manage cursor and connection lifecycle.

e00270f

danielballan merged commit 819d194 into bluesky:main Feb 14, 2025
8 checks passed

genematx reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add SQL adapter #779

add SQL adapter #779

skarakuzu commented Aug 22, 2024 •

edited

Loading

danielballan commented Oct 1, 2024

danielballan commented Dec 13, 2024 •

edited by genematx

Loading

danielballan commented Jan 16, 2025 •

edited

Loading

genematx Feb 24, 2025

skarakuzu Feb 24, 2025

genematx Feb 24, 2025

add SQL adapter #779

add SQL adapter #779

Conversation

skarakuzu commented Aug 22, 2024 • edited Loading

Checklist

danielballan commented Oct 1, 2024

danielballan commented Dec 13, 2024 • edited by genematx Loading

danielballan commented Jan 16, 2025 • edited Loading

For this PR

In following up PRs...

genematx Feb 24, 2025

Choose a reason for hiding this comment

skarakuzu Feb 24, 2025

Choose a reason for hiding this comment

genematx Feb 24, 2025

Choose a reason for hiding this comment

skarakuzu commented Aug 22, 2024 •

edited

Loading

danielballan commented Dec 13, 2024 •

edited by genematx

Loading

danielballan commented Jan 16, 2025 •

edited

Loading